

TFIDF In Libraries: Part I (For Librarians)

This is the first of a three-part series called TFIDF In Libraries, where "relevancy ranking" will be introduced. In this part, term frequency/inverse document frequency (TFIDF) -- a common mathematical method of weighing texts for automatic classification and sorting search results -- will be described. Part II will illustrate an automatic classification system and simple search engine using TFIDF through a computer program written in Perl. Part III will explore the possibility of filtering search results by applying TFIDF against sets of pre-defined "Big Names" and/or "Big Ideas" -- an idea apparently called "champion lists". 


The problem, straight Boolean logic

To many of us the phrase "relevancy ranked search results" is a mystery. What does it mean to be "relevant"? How can anybody determine relevance for me? Well, a better phrase might have been "statistically significant search results". Taking such an approach -- the application of statistical analysis against texts -- does have its information retrieval advantages over straight Boolean logic. Take for example, the following three documents consisting of a number of words, Table #1:

	Document #1  Document #2  Document #3
	Word         Word         Word
	-----------  -----------  -----------
	airplane     book         building
	blue         car          car
	chair        chair        carpet
	computer     justice      ceiling
	forest       milton       chair
	justice      newton       cleaning
	love         pond         justice
	might        rose         libraries
	perl         shakespeare  newton
	rose         slavery      perl
	shoe         thesis       rose
	thesis       truck        science

A search for "rose" against the corpus will return three hits, but which one should I start reading? The newest document? The document by a particular author or in a particular format? Even if the corpus contained 2,000,000 documents and a search for "rose" returned a mere 100 the problem would remain. Which ones should I spend my valuable time accessing? Yes, I could limit my search in any number of ways, but unless I am doing a known item search it is quite likely the search results will return more than I can use, and information literacy skills will only go so far. Ranked search results -- a list of hits based on term weighting -- has proven to be an effective way of addressing this problem. All it requires is the application of basic arithmetic against the documents being searched.


Simple counting

We can begin by counting the number of times each of the words appear in each of the documents, Table #2:

	Document #1     Document #2     Document #3
	Word         C  Word         C  Word         C
	--------------  --------------  --------------
	airplane     5  book         3  building     6
	blue         1  car          7  car          1
	chair        7  chair        4  carpet       3
	computer     3  justice      2  ceiling      4
	forest       2  milton       6  chair        6
	justice      7  newton       3  cleaning     4
	love         2  pond         2  justice      8
	might        2  rose         5  libraries    2
	perl         5  shakespeare  4  newton       2
	rose         6  slavery      2  perl         5
	shoe         4  thesis       2  rose         7
	thesis       2  truck        1  science      1
	
	Totals (T)  46              41              49

Given this simple counting method, searches for "rose" can be sorted by its "term frequency" (TF) -- the quotient of the number of times a word appears in each document (C), and the total number of words in the document (T) -- TF = C / T. In the first case, rose has a TF value of 0.13. In the second case TF is 0.12, and in the third case it is 0.14. Thus, by this rudimentary analysis, Document #3 is most significant in terms of the word "rose", and Document #2 is the least. Document #3 has the highest percentage of content containing the word "rose".


Accounting for common words

Unfortunately, this simple analysis needs to be offset considering frequently occurring terms across the entire corpus. Good examples are stop words or the word "human" in MEDLINE. Such words are nearly meaningless because they appear so often. Consider Table #3 which includes the number of times each word is found in the entire corpus (DF), and the quotient of the total number of documents (D or in this case, 3) and DF -- IDF = D / DF.  Words with higher scores are more significant across the entire corpus. Search terms whose IDF ("inverse document frequency") score approach 1 are close to useless because they exist in just about every document:

	Document #1            Document #2           Document #3
	Word         DF  IDF   Word         DF  IDF  Word        DF  IDF
	--------------------   --------------------  -------------------
	airplane     1   3.0   book         1   3.0  building    1   3.0
	blue         1   3.0   car          2   1.5  car         2   1.5
	chair        3   1.0   chair        3   1.0  carpet      1   3.0
	computer     1   3.0   justice      3   1.0  ceiling     1   3.0
	forest       1   3.0   milton       1   3.0  chair       3   1.0
	justice      3   1.0   newton       2   1.5  cleaning    1   3.0
	love         1   3.0   pond         1   3.0  justice     3   1.0
	might        1   3.0   rose         3   1.0  libraries   1   3.0
	perl         2   1.5   shakespeare  1   3.0  newton      2   1.5
	rose         3   1.0   slavery      1   3.0  perl        2   1.5
	shoe         1   3.0   thesis       2   1.5  rose        3   1.0
	thesis       2   1.5   truck        1   3.0  science     1   3.0


Term frequency/inverse document frequency (TFIDF)

By taking into account these two factors -- term frequency (TF) and inverse document frequency (IDF) -- it is possible to assign "weights" to search results and therefore ordering them statistically. Put another way, a search result's score ("ranking") is the product of TF and IDF:

	TFIDF = TF * IDF where:
	
	   * TF = C / T where C = number of times a given word appears in a document and T = total number of words in a document
		
	   * IDF = D / DF  where D = total number of documents in a corpus, and DF = total number of documents containing a given word

Table #4 is a combination of all the previous tables with the addition of the TFIDF score for each term:

	Document #1              
	Word         C  T   TF     D  DF  IDF  TFIDF
	--------------------------------------------
	airplane     5  46  0.109  3  1   3.0  0.326
	blue         1  46  0.022  3  1   3.0  0.065
	chair        7  46  0.152  3  3   1.0  0.152
	computer     3  46  0.065  3  1   3.0  0.196
	forest       2  46  0.043  3  1   3.0  0.130
	justice      7  46  0.152  3  3   1.0  0.152
	love         2  46  0.043  3  1   3.0  0.130
	might        2  46  0.043  3  1   3.0  0.130
	perl         5  46  0.109  3  2   1.5  0.163
	rose         6  46  0.130  3  3   1.0  0.130
	shoe         4  46  0.087  3  1   3.0  0.261
	thesis       2  46  0.043  3  2   1.5  0.065
	
	Document #2              
	Word         C  T   TF     D  DF  IDF  TFIDF
	--------------------------------------------
	book         3  41  0.073  3  1   3.0  0.220
	car          7  41  0.171  3  2   1.5  0.256
	chair        4  41  0.098  3  3   1.0  0.098
	justice      2  41  0.049  3  3   1.0  0.049
	milton       6  41  0.146  3  1   3.0  0.439
	newton       3  41  0.073  3  2   1.5  0.110
	pond         2  41  0.049  3  1   3.0  0.146
	rose         5  41  0.122  3  3   1.0  0.122
	shakespeare  4  41  0.098  3  1   3.0  0.293
	slavery      2  41  0.049  3  1   3.0  0.146
	thesis       2  41  0.049  3  2   1.5  0.073
	truck        1  41  0.024  3  1   3.0  0.073
	
	Document #3              
	Word         C  T   TF     D  DF  DF   TFIDF
	--------------------------------------------
	building     6  49  0.122  3  1   3.0  0.367
	car          1  49  0.020  3  2   1.5  0.031
	carpet       3  49  0.061  3  1   3.0  0.184
	ceiling      4  49  0.082  3  1   3.0  0.245
	chair        6  49  0.122  3  3   1.0  0.122
	cleaning     4  49  0.082  3  1   3.0  0.245
	justice      8  49  0.163  3  3   1.0  0.163
	libraries    2  49  0.041  3  1   3.0  0.122
	newton       2  49  0.041  3  2   1.5  0.061
	perl         5  49  0.102  3  2   1.5  0.153
	rose         7  49  0.143  3  3   1.0  0.143
	science      1  49  0.020  3  1   3.0  0.061

Given TFIDF, a search for "rose" still returns three documents ordered by Documents #3, #1, and #2. A search for "newton" returns only two items ordered by Documents #2 (0.110) and #3 (0.061). In the later case, Document #2 is almost one and a half times more "relevant" than document #3. TFIDF scores can be summed to take into account Boolean unions (or) or intersections (and).


Automatic classification

TDIDF can also be applied a priori to indexing/searching to create browsable lists -- hence, automatic classification. Consider Table #5 where each word is listed in a sorted TFIDF order:

	Document #1         Document #2         Document #3
	Word         TFIDF  Word         TFIDF  Word         TFIDF
	------------------  ------------------  ------------------
	airplane     0.326  milton       0.439  building     0.367
	shoe         0.261  shakespeare  0.293  ceiling      0.245
	computer     0.196  car          0.256  cleaning     0.245
	perl         0.163  book         0.220  carpet       0.184
	chair        0.152  pond         0.146  justice      0.163
	justice      0.152  slavery      0.146  perl         0.153
	forest       0.130  rose         0.122  rose         0.143
	love         0.130  newton       0.110  chair        0.122
	might        0.130  chair        0.098  libraries    0.122
	rose         0.130  thesis       0.073  newton       0.061
	blue         0.065  truck        0.073  science      0.061
	thesis       0.065  justice      0.049  car          0.031

Given such a list it would be possible to take the first three terms from each document and call them the most significant subject "tags". Thus, Document #1 is about airplanes, shoes, and computers. Document #2 is about Milton, Shakespeare, and cars. Document #3 is about buildings, ceilings, and cleaning.

Probably a better way to assign "aboutness" to each document is to first denote a TFIDF lower bounds and then assign terms with greater than that score to each document. Assuming a lower bounds of 0.2, Document #1 is about airplanes and shoes. Document #2 is about Milton, Shakespeare, cars, and books. Document #3 is about buildings, ceilings, and cleaning.


Discussion and conclusion

Since the beginning, librarianship has focused on the semantics of words in order to create a cosmos from an apparent chaos. "What is this work about? Read the descriptive information regarding a work (author, title, publisher date, notes, etc.) to workout in your mind its importance." Unfortunately, this approach leaves much up to interpretation. One person says this document is about horses, and the next person says it is about husbandry.

The mathematic approach is more objective and much more scalable. While not perfect, there is much less interpretation required with TFIDF. It is just about mathematics. Moreover, it is language independent; it is possible to weigh terms and provide relevance ranking without knowing the meaning of a single word in the index.

In actuality, the whole thing is not an either/or sort of question, but instead a both/and sort of question. Human interpretation provides an added value, definitely. At the same time the application of mathematics ("Can you say 'science?'") proves to be quite useful too. The approaches compliment each other -- they are arscient. Much of how we have used computers in libraries has simply been to automate existing processes. We have still to learn how to truly take advantage of a computer's functionality. It can remember things a whole lot better than we can. It can add a whole lot faster than we can. Because of this it is almost trivial to calculate ( C / T ) * ( D / DF ) over an entire corpus of 2,000,000 MARC records or even 1,000,000 full text documents.

None of these ideas are new. It is possible to read articles describing these techniques going back about 40 years. Why has our profession not used them to our advantage. Why is it taking us so long? If you have an answer, then enter it in the comment box below.

This first posting has focused on the fundamentals of TFIDF. Part II will describe a Perl program implementing relevancy ranking and automatic classification against sets of given text files. Part III will explore the idea of using TFIDF to enable users to find documents alluding to "great ideas" or "great people". 


-- 
Eric Lease Morgan
April 13, 2009